MapReduce: Distributed Computing for Machine Learning
نویسندگان
چکیده
We use Hadoop, an open-source implementation of Google’s distributed file system and the MapReduce framework for distributed data processing, on modestly-sized compute clusters to evaluate its efficacy for standard machine learning tasks. We show benchmark performance on searching and sorting tasks to investigate the effects of various system configurations. We also distinguish classes of machine-learning problems that are reasonable to address within the MapReduce framework, and offer improvements to the Hadoop implementation. We conclude that MapReduce is a good choice for basic operations on large datasets, although there are complications to be addressed for more complex machine learning tasks.
منابع مشابه
A Grid Based System for Data Mining Using MapReduce
In this paper, we discuss a Grid data mining system based on the MapReduce paradigm of computing. The MapReduce paradigm emphasizes system automation of fault tolerance and redundancy, while keeping the programming model for the user very simple. MapReduce is built closely on top of a distributed file system, that allows efficient distributed storage of large data sets, and allows computation t...
متن کاملChapter: Distributed Platforms and Cloud Services Enabling Machine Learning for Big Data. An Overview1
Applying popular machine learning algorithms to large amounts of data raised new challenges for machine learning practitioners. Traditional libraries does not support properly the processing of huge data sets, so that new approaches are needed. Using modern distributed computing paradigms, such as MapReduce, or in-memory processing novel machine learning libraries have been devised. In parallel...
متن کاملPLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google’s...
متن کاملA MapReduce based distributed SVM algorithm for binary classification
Although Support Vector Machine (SVM) algorithm has a high generalization property to classify for unseen examples after training phase and it has small loss value, the algorithm is not suitable for real-life classification and regression problems. SVMs cannot solve hundreds of thousands examples in training dataset. In previous studies on distributed machine learning algorithms, SVM is trained...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کامل